Week 4: Clustering and classification

date()
## [1] "Mon Nov 16 22:40:47 2020"

Task 1

As you can see, this file have been made and uploaded to GitHub.

# In order to keep the Rmd file clean, all the libraries are downloaded here. Keep scrolling.
library(MASS)
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task 2

The data to be analyze this week is called Boston. It comes from MASS package. The data is rather classic set of information. It consist data related to housing values in suburbs of Boston and several other parameters including air quality, crime rate and proportion of people of color at those suburbs. The dataset is rather old. It is first published in 1978.

data("Boston")

The str() method is called for checking the basic structure of the data. The data has 506 rows and 14 columns. The variable names and types are listed here:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Task 3

Here is the pairs-plot of the dataset. Since the data has quite many columns, it practically shows you nothing.

pairs(Boston)

It is better to use corrplot()-method for checking the data set. The tidyverse and corrplot packages were downloaded for this.

The correlations between different parameters are shown here. The larger and darker the circle is, the stronger the correlation is. Blue represents positive and red negative correlation.

cor_matrix<-cor(Boston) %>% round(digits=2)

corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The histograms for each variable can be drawn using ggplot2. Traditionally, the data set is used for demonstrating effects of different parameters to housing values.

ggplot(gather(Boston), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Task 4

In this task, the data set “Boston” is standardized and crime rate is switched to categorical variable.

As one can see, the mean value for each parameter is zero after scaling.

## First, the data set is scaled and summary is printed
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Then the crime rate is switched to categorical variable.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

The final part of this task is to split the scaled data set into test and train data using 80-20 split.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

After the model is trained, the results it gives to the test set is less-biased estimator for the actual model accuracy.

Task 5

First, the LDA-model is fitted to the data set. This time, the catergorical crime rate is used as target variable.

# These simple models can be applied using a single line of code.
lda.fit <- lda(crime ~ ., data = train)

A new function “lda-arrows” is defined for the task. The function draws the LDA bi-plot.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

One can see, that is actually relatively easy to separate the high-crime rate areas

Task 6

First, the correct classes are separated from the test set

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

The rest of the test data is fed into the trained LDA model. The cross tabulated results are printed.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      14        2    0
##   med_low    0      12        9    0
##   med_high   0       7       16    0
##   high       0       0        0   28

As the results show, the model can predict the high crime rate suburbs with nearly 100% accuracy. However, the model truly struggles when it tries to decide, should a suburb belong to low or med_low category. In addition, there is some problems to find the difference between med_low and med_high suburbs.

Task 7

First, let’s reload and standardize the Boston data set

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)

Second, let’s build a k-mean clustering algorithm with the scaled boston data.

Let’s select 2 as the number of clusters. Since 14x14 pair plot is bit difficult to look at. Let’s draw three smaller ones.

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled[1:5], col = km$cluster)

pairs(boston_scaled[5:10], col = km$cluster)

pairs(boston_scaled[10:14], col = km$cluster)

The pair plots shows nice separation of two groups. The results are quite simillar as with the LDA algorithm. With LDA we saw that it is relatively easy to find the suburbs with high crime rate. But there was significant overlapping with suburbs that had low, med_low and med_high crime rate.

Bonus

First, let’s perform the k-means clustering using 3 centers.

km2 <-kmeans(boston_scaled, centers = 3)

Then, let’s try to fit the LDA to the new k-means model output.

# These simple models can be applied using a single line of code.
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled)

Now, let’s look the results. It seems that the model can predict detect the suburbs with high crime rate using this unsupervised learning method.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km2$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Super bonus

First part of the super bonus task

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.